fix: address unresolved review comments from PyPDF File Processor PR#4743#5173
Conversation
|
Hi @RobuRishabh! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
…lamastack#4743 - Remove legacy chunking fallback and _legacy_chunk_file from vector store mixin; raise RuntimeError if FileProcessor API is not configured - Wire file_processor_api through all vector_io providers (registry, factories, adapter constructors) - Make files_api required in PyPDF adapter and processor constructors - Implement chunked file reading (64KB) for direct uploads to cap memory usage - Add size check on file_id retrieval path against max_file_size_bytes - Wrap openai_retrieve_file in try/except to surface clear ValueError for missing file_id, with test coverage - Make .strip() page filter conditional on clean_text config - Remove unused file_processor_api field from VectorStoreWithIndex - Clean up dead imports (make_overlapped_chunks) from mixin - fixed linters, formats using pre-commit checks - fixed pypdf to handle .txt files Signed-off-by: roburishabh <roburishabh@outlook.com>
43b1105 to
1eb5352
Compare
cdoern
left a comment
There was a problem hiding this comment.
I think you are missing additions to VectorIORouter so all the tests are failing bc the args to the router are mismatched.
…s-Unresolved-Reviews
Signed-off-by: roburishabh <roburishabh@outlook.com>
…s-Unresolved-Reviews
…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews
…s-Unresolved-Reviews
- Add MIME type parsing safety check to prevent IndexError - Document chunked file reading approach and rationale - Make file_processors a hard dependency for all vector_io providers - Add unit test for missing file_processor_api error handling Signed-off-by: roburishabh <roburishabh@outlook.com>
|
This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
Remove duplicate legacy chunking code that was incorrectly merged alongside the new FileProcessor API path, and fix incomplete RuntimeError syntax. Also remove unused make_overlapped_chunks import Signed-off-by: roburishabh <roburishabh@outlook.com>
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
✅ Recordings committed successfully Recordings from the integration tests have been committed to this PR. |
|
The replay failures are not related to this PR but they block it from merging. All file_search tests that need to be re-recorded fail during replay because request hashes don't match the recordings. From the server log: RuntimeError: Recording not found for request hash: bfa38c097b0589ca... The file_search tests generate new vector store IDs, file IDs, and chunk scores on each run. These end up in the chat completion request body, changing the hash so the replay system can't find the matching recording. The API recorder needs to normalize these non-deterministic fields before hashing for this PR to pass CI. |
|
that is my best guess at least. |
… hashing Signed-off-by: roburishabh <roburishabh@outlook.com>
…s-Unresolved-Reviews
…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews
…s-Unresolved-Reviews
…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews
Signed-off-by: roburishabh <roburishabh@outlook.com>
…l results Signed-off-by: roburishabh <roburishabh@outlook.com>
…s-Unresolved-Reviews
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ool results The patterns match UUID format (8-4-4-4-12 hex digits) as document IDs are UUIDs, not file-IDs in the actual tool results. Updated OpenAI test recordings with new normalization (16 recordings). Deleted Azure test recordings (85 recordings) - these will be regenerated by CI with the new normalization when Azure integration tests run. Verified locally with OpenAI setup in replay mode - 16/16 tests pass. Signed-off-by: roburishabh <roburishabh@outlook.com>
…s-Unresolved-Reviews
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Update test counts in provider matrix to reflect new Azure file_search recordings. Signed-off-by: roburishabh <roburishabh@outlook.com>
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
cdoern
left a comment
There was a problem hiding this comment.
I really feel like something must be wrong here since we are adding 157,000 lines..... most of which are removing, moving, and re-adding recordings. I dont want to stall this further but I really encourage PRs with less commits and that are tested locally before opening the PR in the future if possible. Thank you!
are you saying this wasn't tested locally? @RobuRishabh I believe you did test this locally, right? |
|
@franciscojavierarceo I am saying that there were a dozen or so commits running pre-commit and re-pushing for integration tests failures both of which could be run locally with |
|
because most of the recording changes here are not functional: eg. updates to the OpenAI SDK version |
Yes, I did test this locally. I ran the responses integration tests with --inference-mode record initially, then to minimize the diff I ran it with |
What does this PR do?
Addresses remaining unresolved review comments from PR #4743 (PyPDF File Processor integration) to ensure the file processing pipeline is consistent, correctly typed, and aligned with API expectations.
Key changes:
_legacy_chunk_filemethod and all fallback paths fromOpenAIVectorStoreMixin. The system now raises a clearRuntimeErroriffile_processor_apiis not configured, instead of silently degrading to legacy inline parsing.file_processor_apithrough all vector_io providers: AddApi.file_processorstooptional_api_dependenciesin the registry, pass it through all 12 factory functions, and accept/forward it in all 9 adapter constructors.files_apirequired in PyPDF constructors: Remove the defaultNonefrom bothPyPDFFileProcessorAdapterandPyPDFFileProcessor, and usedeps[Api.files](bracket access) in the factory to fail fast if somehow missing.file_idretrieval path againstmax_file_size_bytes.file_id: Wrapopenai_retrieve_filein atry/exceptthat surfaces aValueError("File with id '...' not found"), with a new test covering this case..strip()whitespace-only page filter conditional on theclean_textconfig setting.file_processor_apifield fromVectorStoreWithIndexand the now-unusedmake_overlapped_chunksimport from the mixin.Closes #4743
Test Plan
Automated tests
1. Unit tests (mixin + vector_io)
All
test_contextual_retrieval.py(16 tests) andtest_vector_store_config_registration.pytests — these exercise the refactoredOpenAIVectorStoreMixin.2. PyPDF file processor tests (20/20 pass)
uv run --group test pytest -sv tests/integration/file_processors/test_pypdf_processor.py3. Full integration suite (replay mode)
uv run --group test pytest -sv tests/integration/ --stack-config=starterResult: 4 failed, 54 passed, 639 skipped, 1 xfailed
All 4 failures are pre-existing and unrelated:
test_safety_with_image— Pydantic schema mismatch (type: 'image'vs'image_url')test_starter_distribution_config_loads_and_resolves/test_postgres_demo_distribution_config_loads— relative pathFileNotFoundErrortest_mcp_tools_list_with_schemas— no local MCP server (Connection refused)No regressions in vector_io, file_search, or ingestion workflows.
Manual E2E verification (with starter distro)
1. Verify route is registered:
Expected:
{ "route": "/v1alpha/file-processors/process", "method": "POST", "provider_types": [ "inline::pypdf" ] }2. Verify OpenAPI contains the endpoint:
3. Direct file upload:
Expected: chunks response with
metadata.processor = "pypdf".4. Via file_id:
Expected: chunks response with
metadata.processor = "pypdf"andfile_idin chunk metadata.